-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added specification for OpenTelemetry metric shipping #742
Conversation
specs/agents/metrics-otel.md
Outdated
OpenTelemetry metric names are allowed to contain dots. This can lead to mapping problems in elasticsearch if for example metrics with the names `foo.bar` and `foo` are both ingested. | ||
Due to the nested object representation within metricsets, `foo` would need to be mapped as both an object and a number at the same time, which is not possible. | ||
|
||
To avoid such conflicts, agents MUST replace all dots in metric names with `_` upon export iff the `dedot_custom_metrics` configuration option is set to `true`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current spec text here is supposed to act more as a starting point for a discussion.
The java agent already has the dedot_custom_metrics
configuration, which is on by default. This currently only affects Micrometer metrics, as this is the only way of collecting custom metrics atm.
On the otherside, the APM server does not perform a dedoting of metric names when receiving metrics via OTLP.
This means with the current proposal, the transition from sending metrics via the OTLP exporter vs shipping them via the agent would not be seamless: because dedot_custom_metrics
is on by default, the metric names would change.
There are some alternatives to consider:
- Add a
dedot_custom_metrics
option to the APM-server and make it on by default (breaking change) - Change the default
dedot_custom_metrics
in the java agent to be disabled by default (breaking change) - Separate the
dedot_custom_metrics
in the java agent into two different config options, one for Otel and one for micrometer. Terrible idea imo, because I don't see a use case where these two would be configured differently.
Please use this discussion to vote for what you think is best or to propose alternative solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: https://github.com/elastic/apm-dev/issues/856 (internal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more thinking and offline discussions, what about the following "optimistic" solution:
In long/mid term, we want to get rid of the dedoting in general by mapping metricsets in elasticsearch using subobjects: false
. Until we get there, we should aim to minimize breaking-changes.
So for all agents which do not have the dedot_custom_metrics
config yet (=everyone but java):
- Agents MAY add the
dedot_custom_metrics
configuration option, but don't need to (maybe added later ondemand if issues with users arise) - If
dedot_custom_metrics
is added, the default MUST befalse
-> no breaking change required later on when we support dots in names fully
For all agents which already have the dedot_custom_metrics
configuration (= the java agent):
- Keep the configuration
dedot_custom_metrics
with its default as is and respect it for Otel metrics - As soon as we support dots in metric names fully, switch the default to
false
(breaking but mitigable change)
Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM but proposing a slight variation that minimizes breaking changes in the future: Ignore the dedot_custom_metrics
flag for OTel metrics in the Java agent and not implement it in other agents. As this can lead to data loss, we probably can't GA the feature (debatable as we did GA OTLP metrics intake). Work on support for subobjects: false
: We could consider flattening the metric documents in APM Server, try getting elastic/elasticsearch#88934 prioritized or contribute this feature to ES (👨🚀⏳?).
Then we only need to introduce the breaking change of changing the default for dedot_custom_metrics
for Micrometer. I suppose that we don't really need the dedot_custom_metrics
flag at that time anymore so we can deprecate it when we change the default to false
and eventually remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably can't GA the feature (debatable as we did GA OTLP metrics intake)
Due to the fact that we create separate metricsets for the Otel data instead of merging them with e.g. the agent collected jvm metricsets I think the mapping conflicts would not be too bad, at least not worse than for the OTLP intake. E.g. we would only loose Otel metrics data, all other metrics would continue to work normally. So I'm not sure whether this really is a GA blocker.
specs/agents/metrics-otel.md
Outdated
| `foo_bar` | `baz` | `string` | | ||
| `testnum_` | `42.42` | `number` | | ||
|
||
The OpenTelemetry specification allows the definition of metrics with the same name as long as they reside within a different instrumentation scope ([spec link](https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter)). In order to avoid conflicts, agents MUST add a mandatory label to the metricset with the key `otel_instrumentation_scope_name` and the corresponding instrumentation scope name as label value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another difference to how the APM server currently does things when receiving metric data via OTLP.
The APM-Server will create a different metricset per instrumentation scope
, but won't add any label to distinguish them (to my knowledge).
This means that when displaying the metrics in Kibana, the metrics with the same name will get mixed up even if they come from different instrumentation scopes.
Would be great if @elastic/apm-server could clarify here if I'm missing something or if my proposed label makes sense and we maybe should add it on the OTLP intake side aswell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes on this.
- https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter and https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope are not very clear, but one could read from those that the instrumentation scope
version
(and maybeschema_url
) fields may also be relevant for uniquely identifying a scope. - FWIW, looking at the Prometheus exporter from the OTel JS SDK, the instrumentation scope is not included anywhere in the output. I created a JS script that has two
my_counter
metrics with diff instrumentation scopes:
...
const meter = otel.metrics.getMeter('my-meter') // 'my-meter' is the instrumentation scope name
const counter = meter.createCounter('my_counter', { description: 'My Counter' })
const meter2 = otel.metrics.getMeter('my-meter2')
const counter2 = meter2.createCounter('my_counter', { description: 'My Counter 2' })
setInterval(() => {
counter.add(1)
counter2.add(1)
}, 1000)
The resulting exported Prom metrics:
% curl localhost:3002/metrics
# HELP target_info Target metadata
# TYPE target_info gauge
target_info{service_name="use-otel-prom",telemetry_sdk_language="nodejs",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.9.1"} 1
# HELP my_counter_total My Counter
# TYPE my_counter_total counter
my_counter_total 317 1679701732718
# HELP my_counter_total My Counter 2
# TYPE my_counter_total counter
my_counter_total 317 1679701732718
So, I wonder if this is the conflict we could later come back if/when there is a real use case.
- If we were to add an intake field or label for this, https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/mapping-to-non-otlp.md#instrumentationscope describes
otel.scope.name
andotel.scope.version
. I'm not sure if that applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter and https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope are not very clear, but one could read from those that the instrumentation scope version (and maybe schema_url) fields may also be relevant for uniquely identifying a scope.
It is true that the instrumentation scope name alone might not be good enough to prevent clashes in all cases. It is good enough to prevent cross-library clashes (e.g. library A and B happen to export a metric with the same name) but won't handle the case where you have multiple versions of the same library with breaking changes in their metrics.
My thought here was to start simple and to leave the option open to maybe just use a single label (name and version concatenated) instead of adding both immediately.
FWIW, looking at the Prometheus exporter from the OTel JS SDK, the instrumentation scope is not included anywhere in the output. I created a JS script that has two my_counter metrics with diff instrumentation scopes
So, I wonder if this is the conflict we could later come back if/when there is a real use case.
Interesting observation that this is currently not even handled by the Otel prometheus exporter.
We can definitely just ignore the InstrumentationScope
for now and add a label later on, as it would be a non-breaking change.
We should however imo keep in the spec that InstrumentationScopes have to use a different metricsets, even if they have the same labels. Otherwise we would need to decide on a "winner" metric if we encounter a clash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meeting today we discussed using otel.scope.name
and otel.scope.version
and perhaps adding those at top-level fields (@gregkalapos volunteered).
@Mpdreamz mentioned (if I understood correctly) that metricbeat uses ECS observer.name
(https://www.elastic.co/guide/en/ecs/current/ecs-observer.html#field-observer-name) for similar purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left the concrete addition of the fields for otel.scope.name
and otel.scope.version
for the future, so that this doesn't block merging this spec:
Agents MUST report metrics from different instrumentation scopes in separate metricsets to avoid naming conflicts at collection time. This separation MUST be done based on the instrumentation scope name and version. In the future, we might add dedicated intake fields to metricsets for differentiation them based on the instrumentation scope identifiers.
## Labels | ||
|
||
Conceptually, elastic metric labels keys correspond to [OpenTelemetry Attributes](https://opentelemetry.io/docs/reference/specification/common/#attribute). | ||
When exporting, the agents MUST convert the attributes to labels as described in this section and MUST group metrics with the same attributes and OpenTelemetry instrumentation scope together into a single metricset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the OTel tracing bridge, we introduced an otel.attributes
property in the intake v2 spec: https://github.com/elastic/apm/blob/main/specs/agents/tracing-api-otel.md#attributes-mapping.
APM Server then maps known attributes to ECS and unknown attributes to labels. I wonder if we should do the same for metrics. Does APM Server currently map OTel metric attributes to ECS fields?
Maybe this isn't as relevant to metrics as it is to traces if the metric attributes are mostly used for custom user-defined attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server maps the metadata and some known and defined metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first parse, great progress on this 👍 .
Few, minor comments below.
Co-authored-by: Gergely Kalapos <[email protected]>
specs/agents/metrics-otel.md
Outdated
For **1.** we want to make sure that the existing user configuration is respected by our apm agents. Ideally, agents SHOULD just register an additional exporter to the existing OpenTelemetry Metrics SDK instance(s). If the agent and language capabilities allow it, the exporter registration SHOULD be done automatically without requiring code changes by the user. | ||
|
||
For **2.** agents MAY automatically register an agent-provided SDK instance to bind the user provided OpenTelemetry API to, if this is possible in their language and does not cause too much overhead of any kind (e.g. implementation or agent package size). | ||
Agents MUST NOT override a user-provided global OpenTelemetry metrics SDK with their own SDK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] Maybe worth mentioning that this is different from the OTel bridge where the agent overrides the values returned by GlobalOpenTelemetry
with its own implementation. Here with metrics SDK the agent just adds an extra exporter at the last step of the metrics pipeline (potentially replacing another sending metrics to another backend).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning that this is different from the OTel bridge where the agent overrides the values returned by GlobalOpenTelemetry with its own implementation.
Isn't GlobalOpenTelemetry
something java-specific? It is just the java approach of binding the opentelemetry API to an implementation (e.g. an SDK or our bridge).
Here with metrics SDK the agent just adds an extra exporter at the last step of the metrics pipeline (potentially replacing another sending metrics to another backend)
I don't quiet understand what you mean here? When we add our exporter, we don't replace existing exporters on the user-provided metrics SDK instance. Instead, both exporters live happily side by side.
With this change the APM agent adds support for using OTel Metrics in two ways: 1. The `@opentelemetry/api` package is instrumented so that usage of `.metrics.getMeterProvider(...)` will use one provided by the APM agent, if the user hasn't explicitly provided one of their own. This implementation uses a version of the OTel Metrics SDK included with the APM agent. It is configured to sent metrics to APM server by default. This allows a user to use the OTel Metrics API and get metrics set to APM server without any configuration. 2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so that when a user creates a `MeterProvider`, the APM agent will added a `MetricReader` that sends metrics to APM server. This allows a user to configure metrics as they wish using the OTel Metrics SDK and then automatically get those metrics sent to APM server. This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the 256 jobs GH Actions limit. I'm not yet sure if those will work well. Closes: #2954 Refs: elastic/apm#691 Refs: elastic/apm#742 (spec PR)
With this change the APM agent adds support for using OTel Metrics in two ways: 1. The `@opentelemetry/api` package is instrumented so that usage of `.metrics.getMeterProvider(...)` will use one provided by the APM agent, if the user hasn't explicitly provided one of their own. This implementation uses a version of the OTel Metrics SDK included with the APM agent. It is configured to sent metrics to APM server by default. This allows a user to use the OTel Metrics API and get metrics set to APM server without any configuration. 2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so that when a user creates a `MeterProvider`, the APM agent will added a `MetricReader` that sends metrics to APM server. This allows a user to configure metrics as they wish using the OTel Metrics SDK and then automatically get those metrics sent to APM server. This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the 256 jobs GH Actions limit. I'm not yet sure if those will work well. Closes: #2954 Refs: elastic/apm#691 Refs: elastic/apm#742 (spec PR)
With this change the APM agent adds support for using OTel Metrics in two ways: 1. The `@opentelemetry/api` package is instrumented so that usage of `.metrics.getMeterProvider(...)` will use one provided by the APM agent, if the user hasn't explicitly provided one of their own. This implementation uses a version of the OTel Metrics SDK included with the APM agent. It is configured to sent metrics to APM server by default. This allows a user to use the OTel Metrics API and get metrics set to APM server without any configuration. 2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so that when a user creates a `MeterProvider`, the APM agent will added a `MetricReader` that sends metrics to APM server. This allows a user to configure metrics as they wish using the OTel Metrics SDK and then automatically get those metrics sent to APM server. This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the 256 jobs GH Actions limit. I'm not yet sure if those will work well. Closes: elastic#2954 Refs: elastic/apm#691 Refs: elastic/apm#742 (spec PR)
sanitize_field_names
)CODEOWNERS
)To auto-merge the PR, add
/
schedule YYYY-MM-DD
to the PR description.